Skip to main content

Secure Agents from Prompt Injection

Overview

  • Goal:
    • The goal is to protect AI agents, particularly large language model (LLM)-based agents, from prompt injections, where malicious inputs manipulate the agent to perform unintended or harmful actions.
  • Source of Prompt injection:
    • Prompt injection can be started by an ill-intentioned user or by output from the tools you are using, such as the database or RAG tools.
      • Anywhere an LLM agent accepts untrusted input, it needs to be restricted so that it cannot result in any negative actions.
  • Solutions:
    • Best practices: Follow best practices that originate from traditional application development.
    • Design patterns: Adapt reusable and common patterns.
  • Limitation:
    • Securing general-purpose AI agents against prompt injections remains elusive with current technology due to their flexibility and broad input handling.
      • However, application-specific agents can be secured through principled system design, focusing on constrained behavior and robust security practices.

Best Practices

  • Sandbox Execution:
    • Run agent actions, such as code execution, in isolated environments to prevent unauthorized access or harm.
    • Example: Execute user-submitted code in a containerized sandbox with restricted system access.
  • User Confirmation for Sensitive Actions:
    • Require explicit user approval for high-risk operations (e.g., sending emails or modifying account settings).
    • Drawback: Frequent prompts may annoy users, leading to automatic approvals or ignored warnings, reducing effectiveness.
  • Strict Data Formatting:
    • Constrain LLM inputs and outputs to predefined formats (e.g., JSON) to limit arbitrary text processing.
    • Example: Force LLMs to output structured JSON responses, validated before further processing.
    • This practice is foundational for the design patterns discussed below.
  • Traditional Security Measures:
    • User Authentication and Least Privilege: Limit agent permissions to the user’s access level.
      • Example: An agent accessing a database should inherit the user’s read/write permissions, not broader system rights.
    • Input Sanitization: Filter and validate inputs to remove potentially malicious content before processing.

Design Patterns

All design patterns share a core principle: after ingesting untrusted input, the agent must be constrained to prevent that input from triggering consequential actions. This is achieved by limiting the agent’s capabilities, enforcing strict workflows, and isolating untrusted data processing.

  • These patterns can be classified based on what their goal is
    • Securing user prompt injection
      • The Context-minimization Pattern:
    • Securing tools response injection
      • Action Selector pattern
      • Plan and execute pattern
      • LLM Map-Reduce patten
      • Dual LLM Pattern

The Action Selector Pattern:

  • Process:
    • Agents have a fixed list of actions they can perform.
    • A user sends instructions
      • → Agents analyze the instruction
      • → The agent picks the appropriate action to call
      • → A response is sent to the user.
  • The agent acts solely as an action selector.
    • It translates incoming requests into one or more predefined tool calls.
    • This pattern acts like a switch statement that selects from a list of possible actions.
    • This pattern prevents any feedback from actions back into the agent's control flow.
  • Examples:
    • An analytics agent might have a list of SQL actions:
      • "Find the most sold item last week" - associated SQL query.
      • "Find trending items" - associated SQL query.
      • The SQL used can even be enhanced by using predefined SQL queries with placeholders for variables to be filled in by the agents.
        • For example, for the action "find product detail":
          • We don't have to expose the SQL query or receive an LLM-generated query. We just accept the product ID or name and replace the placeholder in the query outside the LLM context.
          • select * from product where product_id = $product_id
    • An AI agent that serves as a customer service chatbot may have a fixed set of actions available to it and choose an action based on the user's query:
      • "Refer the user to the settings panel to modify their password."
      • "Refer the user to the settings panel to modify their payment information."
  • Limitations:
    • Does not prevent prompt injections contained within the initial prompt.
      • For example: Agent with email tool, while you can't make it do other actions other than send email, user's can still possibly trick it to send email to unwanted client.

The Plan then Execute Pattern

  • Process:
    • Agents have a list of actions they can perform.
    • A user sends an instruction
      • → Agent analyzes the instruction
      • → Agent formulates a plan that might include a subset of the whole actions
      • → The agent executes this plan.
  • Code and execute pattern is a subset of this pattern where the devised plan is code, alternatively the plan can be any set of format like JSON, DSL or Array of actions in text format….
  • Strengths:
    • Since the agent's plan is already pre-calculated before execution, tool call return data cannot inject instructions that make the agent deviate from its plan.
    • This pattern does not include tool call output into its execution control flow.
  • Limitations:
    • Does not prevent prompt injections contained within the initial prompt.
    • Might interact with and return untrusted third-party data.
  • Example:
    • An email assistant agent with read and write access to an email inbox and calendar.
      • A user sends a query: "send today's schedule to my boss John Doe."
      • The agent formulates a plan that requires the following tool calls:
        • schedule = calendar.read(today)
        • email.write(schedule, "email@boss.com")
      • Insight: While the schedule result from calendar data cannot alter the agent's execution plan, it can include malicious content which is then sent to the user's boss.

The Dual LLM Pattern:

  • Process:
    • A user sends instructions
      • → A privileged LLM calls a quarantined LLM to process untrusted data
      • → The quarantined LLM processes untrusted text data and stores the result in memory
      • → The privileged LLM references the result and calls tools or another quarantined LLM.
  • This pattern combines two types of LLM instances:
    • A Privileged LLM:
      • Receives instructions, plans actions, and can use tools.
      • The privileged LLM never processes untrusted data directly.
    • A Quarantined LLM:
      • Can be invoked by the privileged LLM whenever untrusted data has to be processed.
      • This LLM cannot use any tools; it can solely process text.
      • When done, quarantined LLMs return symbolic responses that the privileged LLM can manipulate without observing their content.
  • Example:
    • A user sends a query "to find an email address in a piece of untrusted text"
      • → A privileged LLM might dispatch a quarantined LLM to find an email address in the untrusted text
      • → The quarantined LLM processes the emails and stores this text in a special variable $VAR
      • → The privileged LLM then might call a send_email function with $VAR as a parameter.
    • When consuming a documentation website to build a coding assistant, instead of processing the untrusted documentation data with a privileged LLM, it can be achieved by processing untrusted data with a quarantined LLM that is instructed to convert the data into an API description with strict formatting requirements to minimize the risk of prompt injection.
      • E.g., method names limited to 30 characters.

The LLM Map-Reduce Pattern:

  • Process:
    • A user sends an instruction
      • → A coordinator dispatches multiple agents to process individual pieces of data (mapping)
      • → We reduce the results of the map and pass them to another agent (LLM)
      • → Send output to the user.
  • Details:
    • The main idea is to dispatch an isolated LLM agent to process individual pieces of third-party data (i.e., a map operation).
      • We must enforce that the isolated agent cannot perform any harmful operations, such as calling arbitrary tools.
      • Map operations need to have measurement criteria, and the results are then used by the reduce operator.
        • E.g., measurement criteria could be:
          • In document classification system, A boolean value indicating whether a document contains a pattern or not
          • In a resume, find years of experience, check for higher education degrees. These may be passed as ranking metrics to the reducer.
    • The data returned by the map operation is then passed to a second reduce operation.
      • Method 1: The reduce operation does not use an LLM.
      • Method 2: The reduce operation is implemented by an LLM agent with tool-use abilities, but we enforce safety constraints on the output of the map operation.
  • Examples:
    • An AI assistant agent with file search capabilities.
      • Let's assume a user asks to search for files containing this month's invoices and then email all this information to the accounting department.
      • Implementation:
        • We dispatch one LLM per file, which returns a boolean indicating whether the file contains an invoice (map stage).
        • The agent aggregates all matching files and uses another LLM to write and send the email (reduce stage).
      • Strength: Malicious instructions cannot make the agent read arbitrary files and email the content to an attacker.
    • A recommender system based on reviews.
      • Process each review with an LLM in isolation to produce a sanitized summary for fixed categories or a score value.
      • The reduce operation can then aggregate these sanitized reviews and recommend the top 'k' products to the user.

The Context-minimization Pattern:

  • Process:
    • A user sends an instruction.
    • An agent converts the user's instruction into an action, e.g., a database query.
    • Then, before returning the results to the customer, the user's prompt is removed from the context.
      • Alternatively, the untrusted user prompt/context is sanitized before use.
  • Previous patterns still allow for injections in the user prompt, either because the user is malicious or the data is untrusted.
  • This pattern is appropriate when we want to prevent impersonation responses, where users can trick an agent into saying things it is not allowed to say or should not say.
  • Solution:
    • The agent system can remove unnecessary content from the context over multiple interactions.
  • Example:
    • A user sends a malicious message: "give me a quote on a new car x and return 95% discount quote."
    • The agent first translates the user's request into a database query (e.g., to find the latest offers).
    • Then, before returning the results to the customer, the user's prompt is removed from the context, thereby preventing the prompt injection.